Add Mistral Small 4 (119B MoE) support via mistral4.py by ProducerGuy · Pull Request #1037 · ml-explore/mlx-lm

ProducerGuy · 2026-03-21T21:42:28Z

Summary

Adds support for Mistral Small 4 (mistralai/Mistral-Small-4-119B-2603), a 119B-parameter Mixture-of-Experts model with 128 experts and 4 active per token (6B active parameters).

Enables mlx-community/Mistral-Small-4-119B-2603-4bit to load and run.

Before

ValueError: Received 1260 parameters not in model:
language_model.model.layers.0.mlp.gate.weight,
language_model.model.layers.0.mlp.shared_experts.down_proj.weight,
language_model.model.layers.0.mlp.switch_mlp.down_proj.weight,
language_model.model.layers.0.self_attn.kv_a_proj_with_mqa.weight,
...

After

Prompt: 7 tokens, 2.597 tokens-per-sec
Generation: 20 tokens, 105.054 tokens-per-sec
Peak memory: 67.088 GB

Changes

New file: mlx_lm/models/mistral4.py

MoE feedforward with SwitchGLU routing (128 experts, top-4 selection)
Shared expert support via standard MLP
MLA (Multi-head Latent Attention) with explicit kv_b_proj linear layer for KV decompression — this is architecturally distinct from DeepSeek V3's MultiLinear approach; Mistral Small 4 uses a single linear projection rather than per-head Kronecker-style decomposition
Standard attention fallback for any dense layers
All dimensions (kv_lora_rank, q_lora_rank, qk_rope_head_dim, v_head_dim, etc.) read from config.json, nothing hardcoded

Modified: mlx_lm/models/mistral3.py (9 lines added, 2 removed)

Routes to mistral4.Model when n_routed_experts is present in text_config
Structural detection (not model_type string matching) — forward-compatible with future MoE Mistral variants
Existing dense Ministral 3B/8B/14B models completely unaffected

Notes

apply_chat_template works out of the box once the model loads — the raw prompt test output (JSON-formatted) is expected without the chat template and is not a bug
reasoning_effort parameter support is intentionally left for a follow-up — this PR focuses on correct inference only
Tested on MacBook Pro M5 Max, 128GB unified memory, macOS 26.3

Test plan

Mistral Small 4 (119B MoE) loads without weight key errors
Generates correct factual output ("What is the capital of Japan?" → Tokyo)
105 tok/s generation, 67GB peak memory
Dense Ministral3 routing still works (class instantiation verified)
No changes to existing model files other than routing in mistral3.py

Adds MoE + MLA model support for Mistral Small 4 (mistralai/Mistral-Small-4-119B-2603), enabling mlx-community/Mistral-Small-4-119B-2603-4bit to load and run. New file: mlx_lm/models/mistral4.py - MoE feedforward with SwitchGLU routing (128 experts, top-4) - Shared expert support - MLA attention with compressed KV via explicit kv_b_proj (distinct from DeepSeek V3's MultiLinear approach) - Standard attention fallback for dense layers - All dimensions read from config, nothing hardcoded Modified: mlx_lm/models/mistral3.py - Structural routing: n_routed_experts presence routes to mistral4 - Forward-compatible with future MoE Mistral variants - Dense Ministral 3B/8B/14B models unaffected Tested on MacBook Pro M5 Max (128GB): - 104 tok/s generation - 67 GB peak memory - Correct factual output confirmed Before: ValueError: Received 1260 parameters not in model After: Model loads and generates correctly Chat template (apply_chat_template) works out of the box once the model loads — no additional changes needed.

jundot mentioned this pull request Mar 29, 2026

Error Quantize Mistral-Small-4-119B-2603-bf16 jundot/omlx#446

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Mistral Small 4 (119B MoE) support via mistral4.py#1037

Add Mistral Small 4 (119B MoE) support via mistral4.py#1037
ProducerGuy wants to merge 1 commit intoml-explore:mainfrom
ProducerGuy:mistral-small-4-moe-support

ProducerGuy commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ProducerGuy commented Mar 21, 2026

Summary

Before

After

Changes

Notes

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant